Reference Database

ABSTRACT

Data acquisition and cataloging are used to classify polypeptides into a reference index or database. The database can be used to identify previously unidentified samples. New polypeptides are characterized and added to the database.

This application is a continuation-in-part of U.S. Ser. No. 654,133filed Sep. 1, 2000, the contents of which are incorporated in theirentirety.

FIELD OF THE INVENTION

The invention relates to methods and means for obtaining, storing andusing an index or catalog of proteins. The catalog can be specific for,for example, an organelle, cell, tissue, organ, organism or population.

BACKGROUND OF THE INVENTION

Proteins are the working parts of living cells. With the near completionof the Human Genome Project there is now a need for an integrated systemand program for obtaining, organizing, searching, and for usingexperimentally global information on the protein composition of cells,and on how that composition varies in development, disease, in responseto drugs, toxic agents, and other experimental variables.

The human genome is estimated to code for up to 100,000 differentproteins. Most if not all are post-translationally modified, and/or aretransported from the site of synthesis to the site of function. Many areelements of signaling or communication pathways. The protein compositionof cells changes in an organized manner during development, and manycell-specific proteins are known.

Methods for separating or identifying proteins by immunochemical meansare widely used and well understood. However, no large-scale systematicmeans for producing protein-specific antibodies has been described,hence a library of antibodies to match the ever increasing number ofisolated proteins or the genomic data from the Human Genome Project doesnot exist.

The final proof that a given protein is present in a given cell type,and in a specific organelle of that cell type can be provided byimmunochemical studies on carefully prepared cell and tissue sections.Many instances of such studies have been reported, however, systematicuse of such procedures to confirm the localization of multiple numbers,much less large numbers of proteins has not been described. Such studiescannot proceed in the absence of a library of well-characterizedantibodies to a library of specific proteins.

While many of the elements of the multi-dimensional Human Genome Projectnow exist, at least in part, the extension of that information tosystematic large-scale studies requires innovation, automation andintegration. Tissue and protein samples and fractions rapidly degrade;hence, it is not feasible to organize a project aimed at characterizingall of the proteins in a fashion similar to the Human Genome Projectbased on cooperative efforts at many sites. To further handle perishablesamples, automation is best developed in intimate contact with anexisting operating system. In addition, the elements of an integratedsystem must match each other in throughput and in time requirements. Forexample, cell fractionation of sets of tissues obtained at the same timemust match the requirements of the next step in the fractionationprocess. Thus, the hierarchical disassembly of a freshly obtained tissueto cells, subcellular fractions, separation and analysis at the proteinlevel, and data acquisition and analysis must match and must includequality control elements so that key steps may be repeated while thesamples are still in good condition and available.

To organize, search and experimentally manipulate information relatingto such a large number of functional entities will require both atheoretical framework in which new knowledge can be organized, means forobtaining the wide range of data required, and means for doing theexperimental studies required to test new hypothesis. Such means did notexist previously in an integrated or integratable form.

The human body is composed of approximately 252 different cell types,all descendant through different intermediate cells from the three germlayers, and ultimately from a single fertilized human egg. While alldiploid cells contain the same genetic information, different genes areexpressed in different cell types and at different times duringdevelopment and during the cell cycle. A protein gene product expressedin several cell types may differ in abundance. In addition, most, if notall proteins are post translationally modified. Further, proteins aresynthesized in one set of structures (ribosomes), but target themselvesinto other subcellular structures.

It has been estimated that between 28,000 and 120,000 genes are presentin a human. The present consensus estimates between 30,000 to 70,000genes. However, each gene does not necessarily correspond to oneprotein. Many genes are expressed in only one gender, at only onedevelopmental stage and in response to certain different stimuli. Thus,the number of protein “gene products” present are considerably less.

However, a single gene may produce several different protein forms asthe result of alternative splicing, cleaved signal sequences,posttranslational glycosylation, phosphorylation, cleavage, complexingwith cofactors, metal ions, other proteins and other modifications. Forexample, the well-characterized protein insulin may be found as the Cchain or the A chain linked to the B chain. If a separation orpurification is performed under reducing conditions, the A and B chainswill be separated. Thus, a single “gene product” may be visualized as upto three different “proteins” depending on the conditions.

Proteins are the working parts of living cells. All are parts ofself-assembling machines, all can change in abundance in response toexperimental and physiological variables, and all turn over constantly,but at different rates. Under starvation conditions the total cell massmay decrease without loss of any individual function of the restinostate, and will regain but not exceed a predetermined mass when returnedto conditions of normal nutrition, suggesting that the proteome, withits tens of thousands of proteins, is a highly coordinated system.

While collections of proteins are well known, they have not beenpreviously integrated into a unified system able to acquire, organizeand sort the data now required to understand both the molecular anatomyand the molecular physiology of man in terms of the human proteome. Itis evident that such a system would make possible the detaileddescription of diseased states, contribute to understanding aging,redefine cancer, and allow both pharmacology and toxicology to berewritten.

There is therefore an evident need for a cataloging of all of the knownproteins that can serve both the passive anatomical function of a datarepository and an active physiological function as a search engine fornew data and discoveries. An essential attribute of an index issearchability. There is a need for a system, a means and organization tocreate an index that provides the means for searching the data containedtherein for new information and relationships.

It is evident that although some of the data required for such an activeindex can be acquired from the scientific literature, only an integratedprogram, analogous to those in atomic physics and space research, canprovide and manage the vast amounts of data that can and should beacquired.

A Human Protein Index was hypothesized, Anderson & Anderson, Journal ofAutomatic Chemistry 2(4):177-178 (1980) and Anderson & Anderson,Clinical Chemistry 28(4):739-6748 (1982), and in conjunction with thehuman genome project, Anderson & Anderson, American BiotechnologyLaboratory September/October 1985. However, heretofore, the materialsand methods to allow for the development of such a resource ofinformation were not available.

SUMMARY OF THE INVENTION

The instant invention relates to a method and means for systematicallystudying proteins to provide data thereon to enable making a catalog ofproteins. The method of interest accounts for intertissue andinterindividual variability. The method of interest enables the rapidprovisional identification of proteins between and among samples. Thatprovisional identification, which later can be confirmed, then can berelied on to develop further provisional identifications of otherproteins in the same or other samples. The method revealssample-specific markers, such as tissue-specific markers. The methodprovides a protein reference standard be it for an individual protein, aset of proteins or a pattern of polypeptide spots appearing on a 2-Dgel. That sort of reference standard can be applied across organelles,tissues, organs, individuals and so on. The catalog of proteins thus isuseful for identifying and comparing similar and identical proteins fromother sources. such as, other tissues, other individuals of a populationand species. The catalog and patterns will reveal relationships betweenand among proteins, for example, expression thereon under definedconditions, coregulation of proteins and so on. Therefore, proteins thatare coordinately expressed or regulated will be revealed, as willproteins with a reciprocal or antagonistic pattern of expression whereinexpression of one protein wanes or does not occur when another isexpressed. The method yields a reference point for determining thereaction of an individual or a cell, and the proteins thereof, to astimulus. The method provides a reference point to distinguishmanifestations arising from an abnormal state, such as in a diseasestate. The catalog of proteins is useful for identifying sequences ofnucleotides, or clones from a genomic or cDNA bank, that could or doencode a particular protein. As to clones from a genomic bank, knowingthe protein will enable determination of what processing of the genomicsequence occurs to obtain expression of the open reading frame. Theprotein index or database can be aligned, for example, with achromosomal map or to a morbid gene map to reveal associations with aparticular protein and with a particular disease, respectively.Identification of such markers will lend to the development ofparticular diagnostic and therapeutic materials and methods.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic block diagram showing various steps that form partof the analysis for comparing proteins of a plurality of differenttissues, each tissue taken from a single species. 2D is two dimensionalgel electrophoresis. MALDI is matrix assisted laserdesorption/ionization, a form of mass spectrometry (MS). The dark grayarrows depict physical processes, the light gray arrows depict datacomparing processes and the black arrows depict data handling processes.

FIG. 2 is a more detailed schematic block diagram showing various stepsin the analysis depicted in FIG. 1, the steps depicted in FIG. 2 beingdirected to an analysis of one tissue sample at a time.

FIG. 3 is a pixel display of spots from a two dimensional gel (2DG) from160 individuals of serum proteins with common serum proteinsimmunosubtracted. The x coordinate is a digitized measure of proteinisoelectric focusing points and the y coordinate is a digitized measureof the molecular weights such that the graph resembles the conventionalformat for displaying two-dimensional gels

FIG. 4 is the same display as FIG. 3 with co-regulating proteins beingrepresented by circled spot areas and the corresponding near-perfectcorrelations indicating coregulated protein connected by a line. Atleast some of the horizontal lines are believed to represent the sameprotein with a different glycosylated form resulting in a slight chargeshift with minimal molecular weight change.

FIG. 5 is the same as the display of FIG. 4 showing very strongcorrelations.

FIG. 6 is the same as the display of FIG. 5 where all statisticallysignificant correlations are depicted.

DETAILED DESCRIPTION OF THE INVENTION

For the purposes of the instant application, a polypeptide or a peptideis a polymer of amino acid monomers of any length, that is, two or moreamino acid residues, that is biologically relevant. A protein also is apolymer of amino acid monomers of any length, that is, two or more aminoacid residues in length, and which is biologically relevant. Hence, forthe purposes of the instant application, the words polypeptide, peptideand protein are used interchangeably. Another synonym is “spot” which inthe context of the instant invention, relates to a polypeptide, peptideor protein displayed on a 2-D gel by a particular staining method.

Also for the purposes of the instant application, the assemblage ofproteins and the characterizing properties, parameters and featuresthereof are organized into an index, a listing, a database, adictionary, a catalog and so on. The result is an ordered set ofelements, an element being, for example, a protein and the variousdistinguishing properties or parameters thereof. The identity of theprotein need not be known. All of those terms describe a list ofelements that are included into a single assemblage, wherein theelements are characterized by a plurality of features, wherein any onefeature can serve as the basis for ordering the elements in the list.Possible features include, total molecular weight, isoelectric point,tissue distribution, molecular weight(s) of specific fragments and soon. For the purposes of the instant application, all of the above terms,and any other used to describe the list of polypeptides or proteins ofthe instant invention, are used interchangeably.

The protein index or catalog can be obtained for any species or could bean assemblage of proteins from plural species. Preferably, geneticallyidentical individuals or clones are used to avoid normal variation andpolymorphisms in a population. Thus, an inbred strain or a clone can beused. However, to obtain an index that is useful at the populationallevel or that can be used for any wild-type individual from a panmicticpopulation, a number of individuals, inbred strains or clones fromdifferent parentals should be investigated to ascertain the level ofpopulational variation.

However, genetically pure populations are not always available,particularly in sexually breeding plants and animals. The problem may bemost pronounced in humans and wildlife. In those situations, it isnecessary to sample several individuals of a population to determine thelevel of variation and to deduce an “average” for an individual proteinthat accounts for the normal variation found in the population.

At another level, it is beneficial to determine the intraindividuallevel of variation. A reasonable level of comparison would be to comparethe proteins from the plural tissues of an individual. Such a comparisonwould identify those proteins that are similar, those that are identicaland those that are specific to, between and among tissues. By monitoringproteins from various tissues, it will be possible to ascertain thoseproteins that are not altogether identical in physical characteristics,however, carry out the same function.

The term “tissue” is broad and may include different developmentalstages of an organ or structure. Particularly in embryos, organprecursor tissue may not have the same function and may comprisenumerous different proteins. Some embryo proteins are never seen againin the adult organism other than perhaps in cancerous tissue. Thus,different developmental stages of the same structure are considereddifferent “tissues”.

A preferred approach to control for populational variation of a proteinis to sample various tissues of a single individual. That exerciseprovides information on the normal variation of a protein in anindividual, for example, due to post-translational variation, such asvariable glycosylation, as well as limited expression in one or moretissues. Thus, at least one tissue is studied from an individual, butpreferably, more than one tissue is examined. Therefore, at least 5; atleast 6; at least 7; at least 8; at least 9; at least 10; at least 11;at least 12; at least 13; at least 14 at least 15; at least 16; at least17; at least 18; at least 19; or at least 20 tissues can be studied.More than 20 tissues can be examined, such as 30, 40, 50, 60, 70, 80 ormore tissues, and at some point in time, all tissues of an individualwill be studied to ascertain the various classes of proteins, such asthe intertissue distribution of a protein, tissue-specific proteins andthe like.

Sub-tissue distribution, such as in particular cells, organelles,fractions and so on also can be examined. The tissue is treated torelease the individual component cell or cells; the cells are treated torelease the individual component organelles and so on. Those partitionedsamples then can serve as the protein source for discrimination in 2-Dgels and any further methodologies associated therewith.

In the case of a tissue, a tissue sample is obtained and prepared forseparation of the proteins therein using a method that provides suitablelevels of discrimination of the proteins comprising a cell. The proteinscan be obtained by any of a variety known means, such as enzymatic andother chemical treatment, freeze drying the tissues, with or without asolubilizing solution, repeated freeze/thaw treatments, mechanicaltreatments, combining a mechanical and chemical treatment and usingfrozen tissue samples and so on.

To provide a more particularized origin of protein, specific kinds ofcells can be purified from a tissue using known materials and methods.To provide proteins specific for an organelle, the organelles can bepartitioned, for example, by selective digestion of unwanted organelles,density gradient centrifugation or other forms of separation and thenthe organelles are treated to release the proteins therein and thereof.The cells or subcellular components are lysed as described hereinabove.Other specific techniques for isolating single cells or specific cellsare known such as Emmert-Buck et al., “Laser Capture Microdissection”Science 274(5289):998-1001 (1996).

Sensitive methods for cell separation may involve the use of celltype-specific antibodies attached to magnetic beads. Such beads havebeen used to isolate cholangiocytes for high-resolution proteinanalysis. (Cholangiocyte-specific rat liver proteins identified byestablishment of a two-dimensional gel protein database. Tietz et al.,Electrophoresis 19:3207-3212, 1998). Systematic development of magneticbead cell separation requires the isolation of cell type-specificproteins from the cell membranes of as many human cells as possible.Thus, knowledge of the tissue, cell or fraction specific proteins isimportant to cell fractionation systems.

Complete, perfect separation of subcellular particles and of differentcell types is difficult and varying levels of contamination frequentlywill be seen. In addition, instances can occur where two or more celltypes are very difficult to separate without much further development.In such instances, methods for the decomposition of mixtures based onthe analysis of mixtures containing different ratios of two cells may beused. The principles of mixture decomposition applied to the analysis oftwo-dimensional electrophoretic separation of protein samples have beenmentioned in Taylor & Giometti, Appl. Theor. Electrophoresis 1:47-51,1988. Such methods can be applied to subcellular fraction analysis or tothe deconvolution of mixtures of three or more cell types in the instantinvention.

Subcellular fractionation using density gradients and zonal centrifugeshas been described (Anderson, “The Development of Zonal Centrifuges andAncillary Systems for Tissue Fractionation and Analysis” National CancerInstitute Monograph 21, 1966). A variety of methods has been developedaimed at the isolation of one or more subcellular fractions. However,multiple parallel methods wherein a series of similar samples, forexample, liver samples from different individuals, are fractionated inparallel wherein all of the initial sample is recovered and which aretherefore quantitative, have not been described previously nor has anyneed existed for such methods to be developed. In the instant invention,reproducible density gradients and attending materials and methods for2-D gel electrophoresis are formed by the materials and methods ofrelated patent applications, Ser. No. 551,314 filed 18 Apr. 2000; Ser.No. 628,340 filed 28 Jul. 2000; Ser. No. 573,539 filed 19 May 2000; andSer. No. 643,675 filed 24 Aug. 2000; as well as attorney docket numbers40148 filed 21 Jul. 2000 relating to automated SDS electrophoresis, thecontents of which are incorporated by reference. Those techniques allowminor proteins concentrated in one or a few subcellular fractions to beidentified and quantitated. Thus, the dynamic range of the twodimensional gel electrophoresis (2DE) analysis or other analysis isgreatly increased to the level where a comprehensive protein databasenow can be generated.

In 2DE maps of whole tissues, a few proteins are observed which arerestricted to one subcellular fraction. For example, the mitochondrialproteins, HSP 60 and COX-II, and the nuclear proteins, PCNA and LAM-B,are seen on 2D gels, while dozens of minor proteins in those organellesare not. The minor proteins are seen, however, when isolatedmitochondria or nuclei are analyzed separately. An alternative methodfor increasing the dynamic range while preserving quantitation is to useone or a few proteins for quantitative referencing. The amount oflamin-B, for example, can be determined in a gel pattern from a wholetissue, and in a gel pattern obtained using highly purified nuclei. Inthe first pattern, lamin B will be a minor spot, in the latter, a majorspot. The ratio of spot intensity for protein of isolated nuclei may bereferenced to lamin B. The ratio between the lamin B intensity on wholetissue gels and on the gels from isolated nuclei can be used as amultiplier to calculate the quantity of minor proteins in the wholetissue sample. That spot intensity referencing technique can be appliedto any other organelle or source wherein minor proteins are to beidentified.

The lysate can be treated to remove non-proteinaceous matter byparticular treatments, such as digestion with a nuclease or a lipase.The unwanted molecules then can be removed by, for example, physicalmeans, such as, centrifugation, precipitation and so on.

The crude protein preparation can be treated further to enhance thepurity of the proteins. The crude protein preparation also can beexposed to a treatment that partitions the proteins based on a commonproperty, such as size, subcellular location and so on.

For example, the crude lysate can be partitioned prior tohigh-resolution separation of the proteins to reduce the number ofproteins for ultimate separation and to enhance discrimination. Thus,the crude lysate can be fractionated by chromatography. Such apreliminary treatment is particularly useful when a sample is known tocontain one or more abundant proteins, such as, albumin in serum.Removing abundant proteins may enhance the relative abundance of minorspecies of proteins that can be loaded on a 2-DG. Plural preliminaryfractionation steps can be practiced, such as, using multiplechromatography steps, with the chromatography steps being the same ordifferent, or multiple extraction or other partitioning steps. Suitablechromatography methods include those known in the art, such asimmunoaffinity, size exclusion, lectin affinity and so on.

In the experiments yielding the serum protein data given in some of thefigures, the five abundant serum proteins, albumin, transferrin,haptoglobin, alpha-1-antitrypsin and IgG were removed by passing thesample through a column having an immobilized antibody to each of thoseproteins. The process removed over 80% of the proteins and allowedhigher gel loading of less common proteins. Additional data has beengenerated using 11 antibodies to the common serum proteins therebyremoving 93% of the more abundant proteins. That immunosubtractingmethod thus relies on the concurrent use in a single step of multiple,immobilized antibodies to the more common proteins.

The proteins then are separated by a method that provides discriminationand resolution. For example, the proteins can be separated by knownmethods, such as chromatography, immunoelectrophoresis, massspectrometry or electrophoresis. The proteins can be separated in aliquid phase in combination with a solid phase. For example, a suitableseparation method is two-dimensional (2-D) gel electrophoresis.

An overall scheme employing 2-D gel electrophoresis for the initialseparation of proteins is provided in FIGS. 1 and 2.

The blocks in FIG. I indicate the following steps:

Scan 2D Gel A (B) of Tissue A (B): represents the steps of operating acamera or scanner to scan a 2 dimension electrophoresis gel produced inthe steps set forth in FIG. 2, the scanned image then being inputtedinto a computer for computer analysis;

Locate Spots via Image Processing: represents the steps of performing acomputer analysis of the spots that appear in the scanned image of the2D gel to identify location and size of each spot in the 2D gel andthereafter select specific spots to be excised for further study by, forinstance, mass spectrometry;

Cut Spots for MS (Mass Spectrometry) Identification: represents the stepof excising spots from the 2D gel that have been identified as beingdesignated for further study;

Digest Spots to Peptides: represents well know procedures for processingexcised spots in preparation of mass spectrometry analysis;

Prepare MALDI TARGETS: represents spotting or depositing the digestedspots from the 2D gel on a MALDI mass spectrometry sample plate;

MALDI MS Analysis: represents the performance of a mass spectrometryanalysis on each digested spot on the sample plate using a MALDI-TOFmass spectrometry apparatus (a matrix-assisted laser desorptionionization apparatus) where the biological sample is embedded in avolatile matrix and is vaporized by being subjected to an intense laseremission—one such MALDI apparatus being a MALDI-TOF apparatus (TOF istime-of-flight spectrometry), the results of the analysis being the massof the peptides of the tested processed spot samples;

Archive Raw Peptide Masses: represents storage in either or bothcomputer format and paper archive format of the results of the MALDImass spectrometry analysis;

Spot # Peptide #: represents the step of comparing the variousdetermined masses (molecular weight MW) of the peptides analyzed usingthe mass spectrometry apparatus, the peptides of tissue A being comparedto the peptides of tissue B;

Generate Similarity Scores For All Gel A Spot Peptide Masses vs. All GelB spot Peptide Masses: represents the step of generating and storing theresults of the comparison between the peptide masses of the spots of the2D gel of tissue A and the peptide masses of the spots of the 2D gel oftissue B;

Select Similarities Above Threshold Likely To Indicate Protein Identity:represents the steps of selecting those generated similarities inpeptide masses (MW) that clearly indicate a correspondence between spotsin the 2D gel of tissue A and the 2D gel of tissue B;

Retain Putative Matches Where Gel A Spot and Gel B Spot Have Similar pI,MW: represents the storage of the selected similarities between gel Aand gel B, wherein pI represents the isoelectric focusing point of eachprotein separated during electrophoresis;

Gel A Spot 1—Gel B Spot 25: represents a list of the retained putativematches between spots in gel A and spots in gel B;

Warp Gel A onto Gel B Using MS Matches as Landmark Matches: represents acomputer implemented process whereby the spots in the scanned computerimage of gel A are warped into alignment (registration) with the spotsin the scanned computer image of gel B (Warping refers to a process ofapplying geometric corrections to modify the shape of features and tochange their spatial relationships. Warp is a statistical treatment ofthe multiple elements of plural arrays to yield a best fit of thearrays. Another term used for a warping process is rubber-sheetingbecause the warping process can be likened to stretching a rubber sheetwherein portions of one or more images are stretched or shrunk in orderto bring the spots on all the images into registration with one anotherand still maintain relative positional relationships between thespots.);

Match Additional Spots Based Upon Positional Similarity After Warping:represents the steps of matching additional spots based on similarrelative locations of the spots in gel B with the locations in the spotsin warped gel A;

Verify Additional Matches Using MS Data: Marginal Similarity: representsthe steps of performing additional mass spectrometry (MS) analysis ofseveral spots that are in marginally similar locations in the gel B andwarped gel A in order to verify that the various spots are indeed thesame peptides in each of the two gels; and

Homologous Spots Identified, Unmatched Spots Classed as Unique:represents the steps of concluding that all landmark matches, allmatched spots, all aligned spots and all verified matched spots areindeed the same spots common to both gel A and gel B thereby providing arelationship between a plurality of the peptides (proteins) in tissue Aand tissue B, and further classifying all unmatched spots in gels A andB as being unique to respective tissue A or tissue B.

The blocks in FIG. 2 represent the following steps:

Sample Generation: represents known methods of preparing a sample from abiological tissue for subsequent electrophoresis;

1^(st) Dimension Gel Production: represents known methods of preparing agel for use in a first dimension of electrophoresis;

Load Sample on 1^(st) D Gel: represents the step of depositing theprepared sample into the first dimension electrophoresis gel;

Run 1^(st) D Gel: represents subjecting the first dimensionelectrophoresis gel to predetermined amounts of electric current toseparate the prepared sample linearly along the length of the 1^(st) Dgel;

2^(nd) Dimension Gel Production: represents the steps of preparing a 2dimension electrophoresis gel;

Load 1^(st) D Gel On 2^(nd) d D gel: represents the step of taking the1^(st) D gel with the separated sample and depositing the 1^(st)dimension gel on one edge of the 2^(nd) D gel;

Run 2^(nd) D Gel: represents the step of subjecting the 2^(nd) D gel toa predetermined amount of electric current to further separate theproteins from the 1^(st) D gel into a planar two dimensional array ofseparated proteins;

Fix 2^(nd) D Gel: represents the steps of removing the 2^(nd) D gel fromretaining glass plates that supported the 2^(nd) D gel during thecurrent applying process (the electrophoresis) and thereafter treatingthe gel with a fixing solution in preparation for further processing:

CB Stain 2^(nd) D Gel: represents various steps necessary for stainingthe spots on the 2^(nd) D gel using Coomasie blue dye (CB) therebymaking the spots visible;

CB Scan 2^(nd) D Gel: represents the scanning process mentioned withrespect to FIG. 1, whereby the 2^(nd) D gel is scanned by a scanner or acamera to generate a computer processable image of the gel;

Destain 2^(nd) D Gel: represents the process of removing stain from thegel;

Silver Stain 2^(nd) D gel: represents the step of restaining the gelusing a silver stain;

SS Scan 2^(nd) D Gel: represents the step of scanning the silver stained2^(nd) D gel using a camera or scanner, where optionally multipletime-lapse scans of a single gel may be taken during the stainingprocess;

Silver Image Assembly: represents the process of combining multipleimages of a single gel to obtain more refined information as set forthin co-pending U.S. Ser. No. 09/387,728 filed 1 Sep. 1999 entitled “GelElectrophoresis Image Combining . . . ” incorporated herein by referencein its entirety; Kepler De Novo Processing: represents the step ofsubjecting the silver stain image of the gel being processed using theKEPLER™ software or other similar spot analyzing software (KEPLER™ isthe trade name of a data collection, collation and storage meansbeginning with image analysis of stained gels and includingtransformation of that data into a digitized form);

Initial Matching: represents the step of manually (visually) identifyingvarious spots in the gel image;

Impress Fitting: represents a computer implemented process whereby spotsin the scanned gel image are processed in conjunction with manipulationof a tissue-specific master pattern, the master pattern defined relativelocations of various spots and having master spot numbers that identifypreviously considered spots, the process being performed to identifyvarious spots in the scanned 2^(nd) D gel to assign master spot numbersto at least some of those identified spots—the Impress process beingdisclosed in co-pending U.S. patent application entitled “Method andApparatus for Impressing a Master Pattern to a Gel Image” filed 3 Aug.2000 having attorney docket number 40732, incorporated herein byreference in its entirety;

Kepler Database (MAP & MED): represents the step of updating the Keplerdatabase, including the sections of the data base MAP (Molecular Anatomyand Pathology) and MED (Molecular Effects of Drugs);

Cut Spots for MS Identification: represents the steps of locating andexcising various spots that are to be subsequently analyzed using a massspectrometer—one spot cutting (excising) apparatus being disclosed inU.S. Pat. No. 5,993,627 incorporated herein by reference in itsentirety;

Digest Spots: represents the step mentioned above with respect to FIG. 1where spots excised from the 2^(nd) D gel are processed in preparationfor MS analysis;

Prepare MALDI Targets: represents the step mentioned above with respectto FIG. 1 where digested spots are deposited on a sample plate of aMALDI mass spectrometry apparatus;

MALDI MS Analysis: represents the step of analyzing spots using a MALDImass spectrometry apparatus as mentioned above with respect to FIG. 1;

Archive Raw Peptide Masses: represents the step mentioned above withrespect to FIG. 1, wherein the masses (molecular weights) of thepeptides subjected to MS analysis are stored;

Profound & Protein Prospectr represent the steps of comparing theanalysis results using two commercially available software programs.PROFOUND marketed by Proteometrics, Inc. and PROTEIN PROSPECTR marked byApplied Biosystems, Inc.;

Review Ids: represents a review of the various spot identificationsdescribed above;

MS Spot Identification Database: represents the updating of a databasehaving compiled mass spectrometry data therein;

Spot Similarity w/o Identification: represents the step of addingvarious hypothetical identifications of spots to the MS SpotIdentification Database concerning various spots that were not subjectedto MS analysis but where the hypothetically identified spots did fallinto alignment with spots from a different tissue sample 2^(nd) D gel;

LC/MS/MS Analysis: represents various additional analysis steps,including liquid chromatography processes (LC) and tandem massspectrometry processes (MS/MS);

Archive Raw MS Scans: represents the step of storing for futureconsideration the results of all mass spectrometry tests; and

Sequest & Mascot Interp: represents the steps of interpreting theanalysis results using commercially available software programs withSEQUEST being commercially available from Finnegan and MASCOT fromMicromass.

Methods for cell separations from tissues for a limited number of celltypes are known, as are means for subcellular fractionation of certaincomponents, many of which are specific to one tissue or cell type.Separation reagents and methods were not previously available that areapplicable to the separation of every human cell type. Nomultiple-parallel high-resolution methods for subcellular fractionationof many samples of different cells or tissues have been previouslydescribed nor was any such separation methodology ever needed or desiredpreviously.

Means for the partial global separation of cell proteins using highresolution two-dimensional electrophoresis are known, as are methods andsystems for characterizing, sequencing and identifying the separatedproteins by mass spectrometric methods. However, those techniques, fromcell separation through to protein identification have not beenintegrated into one automated system capable of high throughput.Organ-specific and cell-specific proteins also are well known, but nocomplete index of such has been attempted.

In general, 2-D gel electrophoresis separates proteins by charge andmolecular weight (MW). The two parameters on which 2-D separation isbased, namely isoelectric point and mass, are almost completelyunrelated. Thus, the theoretical resolution of the 2-D system is theproduct of the resolutions of each of the constituent methods, which isin the range of 150 molecular species for each of isoelectric focusing(IEF) and of sodium dodecyl sulfate (SDS) gel electrophoresis. Hence,the theoretical resolution for the complete system is about 22,500proteins. In practice, as many as 5,000 proteins have been resolvedexperimentally. Resolution can be enhanced by the selective use ofsample, reproducible and standardized methods and sensitive detectionmeans, for example.

The solid phase gels for 2-D electrophoresis generally are made of aporous polymer, such as polyacrylamide, and are constructed using knownmethods. To minimize interassay and intraassay variability, it isbeneficial if the materials and methods for making the gels arereproducible and perhaps, produced by an automated means to reduceintroduced variability. Gel monomers are mixed with agents that inducepolymerization and then are poured into a mold that dictates the sizeand shape of the polymerized gel. For example, the catalyzed liquid gelmonomer can be poured between glass plates separated uniformly over theentire surfaces thereof to produce a square or rectangular slab gel. Theglass plates can be separated by about a millimeter or a fractionthereof. Thinner gels generally enhance resolution.

Protein samples to be analyzed using 2-D electrophoresis typically aresolubilized in an aqueous, denaturing solution such as one containing achaotropic agent, such as, urea, at a concentration of about 9 M; adetergent, and perhaps a non-ionic detergent, such as, NP-40, at aconcentration of about 2%; a commercially available set of ampholytes,often purchased as a mixture, for example of a defined pH range of 8 to10; and a reducing agent, such as, dithiothreitol (DTT), at aconcentration of about 1%. The solubilization step may be separated intodifferent stages each with different solubilizing solutions to preparedifferent fractions to further distinguish the proteins.

The chaotropic agent and detergent dissociate complexes of proteins withother proteins and with DNA, RNA etc. A suitable ampholyte mixture isone that serves to establish a high pH (˜9) outside the range where mostproteolytic enzymes are active, thereby preventing modification of thesample proteins by such enzymes in the sample. The high pH ampholytescomplex with DNA present in the sample. By complexing the DNA, theampholytes allow DNA-binding proteins to be released while preventingthe DNA from swelling into a viscous gel that interferes withseparation. The reducing agent minimizes the presence of disulfide bondsin the sample proteins, thus allowing the proteins to be unfolded and toassume an open structure optimal for separation.

Samples of tissues, for example, are solubilized by rapid homogenizationin various denaturing, solubilizing solution(s), after which the sampleis centrifuged to pellet insoluble material and DNA. The supernatant iscollected and is amenable to the separation procedure.

To ensure that proteins retain constant chemical properties duringseparation, it is desirable that the sulfhydryl (SH) groups of thecysteine residues do not reform disulfide bridges or become oxidized tocystic acid. Therefore, cysteine residues can be rendered stable byvarious modifications of the sulfhydryl groups, for example, byalkylation with a zwitterionic derivative of iodoacetamide(2-amino-5-iodoacetamido-pentanoic acid). That reaction introduces avery hydrophilic group on the cysteine residues but does not change thenet charge or apparent isoelectric point of the polypeptide.

Such a derivatization can be implemented, for example, using a sizeexclusion gel filtration column to exchange the proteins out of theinitial sample solubilization solution, through a reagent zonecontaining, for example, an alkylating reagent, and finally into amedium suitable for application to an IEF gel. The size exclusion mediumcan be chosen to exclude proteins but not low molecular weight solvents(e.g., polyacrylamide beads such as BioRad P-6 BioGel).

Of the 20 amino acids found in typical proteins, four (aspartic andglutamic acids, cysteine and tyrosine) carry a negative charge and threecarry a positive charge (lysine, arginine and histidine) in some pHrange. A specific protein, defined by the specific sequence of aminoacids thereof, thus is likely to incorporate a number of charged groupstherein. The magnitude of the charge contributed by each amino acid isgoverned by the prevailing pH of the surrounding solution and can varyfrom a minimum of 0 to a maximum of 1 charge (positive or negativedepending on the amino acid) as revealed in a titration curve relatingcharge and pH according to the pK of the amino acid in question. Thetotal charge of the protein molecule is, under denaturing conditions,approximately the sum of the charges of the component amino acids, allat the prevailing solution pH.

Two proteins having different ratios of charged, or titrating, aminoacids can be separated by virtue of different net charges at some pH.Under the influence of an applied electric field, a more highly chargedprotein will move faster through a medium than a less highly chargedprotein of similar size and shape. If the proteins thus are made to movefrom a sample zone through a non-convecting medium, such as, apolyacrylamide gel, an electrophoretic separation will result. If, inthe course of migrating under an applied electric field, a proteinenters a region whose pH has that value at which the net charge of theprotein is zero, that is, the isoelectric pH or isoelectric point, theprotein will cease to migrate relative to the medium. Further, if themigration occurs through a monotonic pH gradient, the protein will‘focus’ at the particular pH value where movement is minimal.

If the protein moves toward more acidic pH values, the protein willbecome more positively charged and a properly oriented electric fieldwill propel the protein back towards the isoelectric point. Likewise, ifthe protein moves towards more basic pH values, it will become morenegatively charged and the same field will drive the protein back towardthe isoelectric point.

The isoelectric focusing separation process can resolve two proteinsdiffering by less than a single charged amino acid among hundreds in therespective primary amino acid sequences.

Formation of an appropriate spatial pH gradient is a requirement of thefocusing procedure. That can be achieved either dynamically, byincluding a heterogeneous mixture of charged molecules (ampholytes) inthe initially homogeneous separation medium, or statically, byincorporating a spatial gradient of titrating groups into the matrixthrough which the migration will occur. The former represents classicalampholyte-based isoelectric focusing, and the latter, the more recentlydeveloped immobilized pH gradient (IPG) isoelectric focusing technique.

The IPG approach has the advantage that the pH gradient is fixed in thegel, while the ampholyte-based approach is susceptible to positionaldrift as the ampholyte molecules move in the applied electric field. Inpractice, the two approaches can be combined to provide a system wherethe pH gradient is spatially fixed, but small amounts of ampholytes arepresent to decrease the adsorption of proteins onto the charged matrixcontaining the IPG.

IPG gels can be created in a thin planar configuration bonded to aninert substrate, such as, a sheet of Mylar plastic that has been treatedso as to bond chemically to an acrylamide gel (e.g., Gelbond® PAG film,FMC Corporation). The IPG gel typically is formed as a rectangular plateabout 0.5 mm thick, 0 to 30 cm long (in the direction of separation) andabout 10 cm wide.

Multiple samples can be applied to such a gel in parallel lanes.However, the ability to separate plural samples must be balanced withthe attending problem of diffusion of proteins between lanes.

When one or more of the separated proteins in a given lane are to berecovered from that lane following focusing, as is typically the case in2-D electrophoresis, it may prove beneficial to split the gel intonarrow strips, such as, about 3 mm wide strips, each of which can be runas a separate gel. Since the proteins of a sample then are confined tothe volume of the gel represented by the single strip, quantitativerecovery of the separated proteins in that strip can be obtained. Suchstrips are produced commercially, for example, by Pharmacia (ImmobilineDryStrips).

While the narrow strip format solves the problem of containing sampleswithin a recoverable, non-cross-contaminating region, there remain otherconsiderations associated with the introduction of sample proteins intothe gel. Since protein-containing samples typically are prepared in aliquid form, the proteins must migrate, under the influence of theelectric field, from a liquid-holding region into the IPG gel to undergoseparation. Thus, for example, the IPG strip can be reswollen, from thedry state, in a solution containing sample proteins, with the intentionthat the sample proteins completely permeate the gel at the start of therun.

Suitable compositions of the components combined to make a focusing gelare known in the art. Solutions of polymerization catalyst and initiator(assuming that each comprises about 10% of the total volume dispensed)can be, respectively, about 1.2% tetramethylethylene diamine (TEMED) andabout 1.2% ammonium persulfate (AP), both in water. The two solutions ofpolymerizable monomers (whose proportions in the output stream vary toyield a gradient of titratable monomers and physical density) may bemade to achieve a gradient over the pH range of about pH 4 to 9. Thetitratable monomers used can be, for example, Immobilines® manufacturedby Pharmacia Biotech. Glycerol and deuterium oxide (heavy water) can beused to increase the density of one of the solutions, thereby helping tostabilize the gradient formed in the mold through the interaction of theresulting density gradient and ambient gravity.

After sample loading, the gel strip is exposed to a device to effectfocusing, for example, the gel strip is moved to one of a plurality ofslots filled with, for example, a non-conducting oil, such as siliconeoil, and having slotted carbon electrodes at both ends positioned so asto contact the ends of the gel. The oil may be circulated, cooled toensure constant running temperature and sparged with a dry gas toeliminate oxygen and dissolved water. Since the resistance of the gelrises during the run, slots maintained at a series of different voltagesare provided, and the strip is moved from one voltage to a highervoltage as the run progresses. For example, a series of voltage stagescan be provided, for example, 1, 2.5, 5, 10, 20 and 40 kilovolts. Thegel can be maintained at each voltage for about 3 hours, except at thelast voltage, where the gel can rest until a second dimension slab gelis available. A total of 200,000 to 300,000 volt-hours may be applied toeach gel.

During the early stages of a separation run, under an applied electricfield, proteins can migrate through the liquid phase of the appliedsample along a pH gradient initially formed by the action of theampholytes incorporated in the sample. Because the proteins initiallyare migrating through liquid, without the retardation associated withmigration through a gel matrix, the proteins can approach individualisoelectric points more rapidly than in a system where the entiremigration path is through a gel.

As the run progresses, the sample-containing liquid is imbibed by thegel, progressively shrinking the channel so that at the end of the run,the channel contains a negligible amount of liquid. That can be achievedby allowing surface water to be removed slowly from the exterior surfaceof the gel during the run, for example, by immersion of the gel incirculated silicone oil that has been dehydrated by sparging with a drygas such as argon or nitrogen.

During gel dehydration, proteins enter the gel at positions near therespective isoelectric points of the proteins. Thus a mixture ofdifferent proteins will enter the gel at points distributed along thegel length, rather than at one site at the edge of a sample well,thereby avoiding the precipitation often observed when a complex mixtureof proteins migrate into a gel together through a small gel surfacearea. Excess liquid is removed through the exterior gel surface, eitherto a dry gas phase or to a water-extracting non-aqueous non-conductingliquid phase such as silicone oil.

Isoelectric focusing and various aspects of gel electrophoresisseparation techniques are described, for example, in U.S. Pat. Nos.4,130,470; 4,196,036; 4,594,064; 5,074,981; 5,164,065; 5,275,710; and5,304,292.

In a 2-D procedure, once the proteins are separated according toisoelectric point, the proteins generally then are separated by size.

The proteins can be native and untreated or treated with a detergent orother reagent that causes the proteins to assume a uniform shape so thatthe separation is based solely on size. For example, the proteins can bedenatured by treatment with a detergent, such as, sodium dodecyl sulfate(SDS).

Charged detergents such as SDS bind strongly to protein molecules andunfold the proteins into semi-rigid rods where the length thereof isproportional to the length of the polypeptide chain and henceapproximately proportional to molecular weight. A protein complexed withsuch a detergent also is highly charged (because of the charges of thebound detergent molecules) and that charge causes the complex to move inthe applied electric field.

Furthermore, the total charge is approximately proportional to molecularweight since the charge of the detergent vastly exceeds the intrinsiccharge of the protein and hence the charge per unit length of aprotein-SDS complex is essentially independent of molecular weight. Thatfeature renders protein-SDS complexes essentially equal inelectrophoretic mobility in a non-restrictive medium. If, however, themigration occurs in a sieving medium, such as a polyacrylamide gel,large (long) molecules will be retarded as compared to small (short)molecules, and a separation based approximately on molecular weight canbe achieved. That is the principal of SDS electrophoresis as appliedcommonly to the analytical separation of proteins.

An important application of SDS electrophoresis involves the use of aslab-shaped electrophoresis gel as the second dimension of atwo-dimensional procedure. The gel strip or cylinder in which theprotein sample has been resolved by isoelectric focusing is placed alongthe slab gel edge and the molecules are separated in the slab,perpendicular to the prior separation, to yield a two-dimensionalseparation.

It is current practice to mold electrophoresis slab gels between twoglass plates, and then to load sample and to run the slab gel stillbetween the same glass plates. The gel is molded by introducing adissolved mixture of polymerizable monomers, catalyst and initiator intothe cavity defined by the plates and spacers or gaskets sealing threesides. Polymerization of the monomers then produces the desired gelmedium. The gasket or form comprising the “bottom” of the molding cavityis removed after gel polymerization to allow current to pass through twoopposite edges of the gel slab: one of the edges represents the open(top) surface of the gel cavity, and the other is formed against theremovable bottom. Typically the gel is removed from the cassette definedby the glass plates after the electrophoresis separation has takenplace, for purposes of staining, autoradiography etc., required fordetection of resolved proteins.

The concentrations of polyacrylamide gels used in electrophoresis aregenerally stated in terms of % T (the total percentage of acrylamide inthe gel by weight) and % C (the proportion of the total acrylamide thatis accounted for by the crosslinker used). N,N′-methylenebisacrylamide(“bis”) is a typically used crosslinker.

In most conventional systems of SDS electrophoresis, use is made of thestacking phenomenon. In a stacking system, an additional gel phase ofhigh porosity is interposed between the separating gel and the sample.Further, the two gels initially contain a different mobile ion from theion source (typically a liquid buffer reservoir) above the gels. Thus,the gels contain, for example, chloride (a high mobility ion) and thebuffer reservoir contains, for example, glycine (a lower mobility ion,whose mobility is pH dependent).

All phases generally contain a known buffer, such as, Tris, as thelow-mobility, pH determining buffer component and positive counter ion.Negatively charged protein-SDS complexes present in the sample areelectrophoresed first through the stacking gel at a pH of approximately6.8, where the complexes have the same mobility as the boundary betweenthe leading (for example, Cl⁻) and trailing (for example, glycine⁻)ions. The proteins are thus “stacked” into a very thin zone sandwichedbetween the Cl⁻ and glycine⁻ zones.

As the stacking boundary reaches the top of the separating gel, theproteins become unstacked because at the higher separating gel pH (8.6),the protein-SDS complexes have a lower mobility. Thus in the separatinggel, the proteins fall behind the stacking front and are separated fromone another according to size as the proteins migrate through thesieving environment of the lower porosity (higher % T acrylamide)separating gel.

Running slab gels can take, for example, one of two modes. A gel in acassette typically is mounted on a suitable electrophoresis apparatus sothat one edge of the gel contacts a first buffer reservoir containing anelectrode (typically a platinum wire) and the opposite gel edge contactsa second reservoir with a second electrode, steps being taken so thatthe current passing between the electrodes is confined to run mainly orexclusively through the gel. Such apparatus may be “vertical” in thatthe upper edge of the gel is in contact with an upper buffer reservoirand the lower edge is in contact with a lower reservoir, or the gel maybe rotated 90° about an axis perpendicular to a plane, and the gel isrun horizontally between a left and right buffer reservoir. Variousother configurations have been devised to make the connectionselectrically and to simultaneously prevent liquid leakage from onereservoir to the other (around the gel).

When used as part of a typical 2-D procedure, an IEF gel is appliedalong one exposed edge of such a slab gel and the proteins withinmigrate into the slab gel under the influence of an applied electricfield. The IEF gel may be equilibrated with solutions containing, forexample, SDS, buffer and reducing agents, prior to placement on the SDSgel to ensure that the proteins in the IEF gel are prepared to migrateunder optimal conditions. Alternatively, the equilibration may beperformed in situ by surrounding the gel with a solution or gelcontaining the components after which the gel is placed in positionalong the edge of the sizing gel.

Gel electrophoresis to size proteins, and the various modifications tothe basic materials and methods, has been described for example, in U.S.Pat. Nos. 4,169,036; 4,594,064; 4,839,016; 5,074,981; 5,209,831;5,217,591; 5,275,710; and 5,306,404.

Because there may be limitations in the degree of resolution anddiscrimination of proteins in a gel, various manipulations can beimplemented to optimize the information that can be obtained. Forexample, individual gels can be configured so that particular and morelimited pH ranges are represented. Thus, a gel can contain a range of pHvalues from 7 through 14, or can contain a range of only three to fourpH units that will provide greater separation within one pH unit.

For larger molecules, the configuration of the matrix can be modified toenable separation thereof. For example, a lower concentration of monomerresulting in a more porous gel can be used. In addition, gels of normalconcentration and separation resolution can be used, but the proteinscan be partially broken down by digestion to provide a subset of smallercomponent polypeptides. The artisan can develop such modifications basedon the prevailing methodologies.

Some proteins may not be amenable to good separation and resolution in2-D electrophoresis, for example, because of extreme hydrophobicityand/or insolubility in the detergents/solvents used in 2-D gels.Examples are the hydrophobic membrane proteins. In that event,alternative procedures are available. For example, the proteins can betreated repeatedly with a solution compatible with 2-D electrophoresis,such as, a buffer containing urea, NP-40, DTT and ampholytes. Theinsoluble proteins are removed, for example, by centrifugation and thesupernatant collected.

Alternatively, an extraction can be performed using an organic solvent.The treated proteins then are applied to a suitable fractionationsystem, such as, SDS gel electrophoresis. with or without heating in SDSbuffer or chromatography in an organic solvent, such as methylenechloride or acetonitrile. The resulting separated proteins arequantified, for example, by optical absorbance, and then should beamenable for further analysis.

To visualize the separated proteins that normally form spots or smearsof varying concentration based on molecular weight and charge, or areisolated at particular sites in the gel, the proteins are treated or arestained to be made detectable. For example, the proteins can be stainedwith a generalized dye that binds non-specifically to proteins, such asCoomasie Blue or a silver-based compound. Alternatively, negativestaining can be practiced, for example by using a zinc salt thatprecipitates SDS in areas lacking protein. The reagents and methods arecommercially available. Other protein stains are known in the art, suchas fluorescent stains, SYPRO Red (Molecular Probes Corp., Oregon) and soon. Other detecting means include using antibodies, particularly labeledantibodies, to identify proteins. A single gel may be stained multipletimes, with optional destaining procedures interspersed.

Thus, for example, in the case of positive protein staining, in a firsttank, the gel is immersed up to the stacking gel in a solutioncomprising for example about 50% alcohol, such as ethanol, about 2%phosphoric acid and water for a period of about two hours to fix theproteins in place and to remove most of the buffer components, such asSDS, Tris and glycine, in the gel. Following fixation, the gel is movedto a tank containing, for example, about 28% methanol, about 14%ammonium sulfate and about 2% phosphoric acid in water and incubated forabout two hours. Next, the gel is moved to a tank containing the samesolution with the addition of powdered Coomassie Blue G250 dye, thewhole liquid volume being circulated continually in the tank. The dyepermeates the gel, binding to resolved protein spots. Finally, the gelis removed from that tank.

A feature of the instant invention is the detailed analysis of themolecular weight and isoelectric point (pI) of the protein. Individualgels are analyzed so that a detailed description of the discriminatedproteins can be obtained. A suitable means to obtain such information isto have the information of each protein cataloged and stored in a datastorage means. A computerized means for scanning, digitizing,processing, analyzing and storing the information is a preferred way forextracting that information and having the information available in amanner for ready comparisons. Thus, an electronic image of the stainedgel is obtained. One example, is scanning the gel. To maximize theinformation for each protein, a gel can be exposed to multiplesubsequent staining procedures. Thus, for example, a low sensitivitystain, such as Coomassie Blue, can be followed by a stain of greatersensitivity, such as a silver stain. The scanning, analyzing and storingof information preferably occurs after each staining procedure.

Moreover, multiple sequential scans can be performed to obtain furtherinformation. Such information can yield enhanced precision and dynamicrange of such non-equilibrium stains, such as a silver stain. In suchcircumstances, the development process yields spots that stainintensely, moderately and at a very low level. By taking multiplesequential scans, spot quantification can be based on measurementparameters other than optical density, such as maximum rate of change ofabsorbance and time of onset of development. Also, proteins may becolored differently based on known or unknown reasons. In any event, anysuch distinction can serve as a diagnostic identifying parameter of aprotein.

A suitable means for obtaining the raw information for further dataanalysis would be to scan the pattern of discriminated proteins in a gelby an image processing means to yield a digitized image. Scanning can beperformed by gently laying the gel on a horizontal vertical or tiltedilluminating table. An overhead digital camera, such as a CCD digitizer,then is used to acquire an image of the gel and the stained proteinspots in absorbance mode. Alternative scanning modes may be practicedfor measuring fluorescence or light scattering, depending on the stainused.

The data obtained from the scanning means then is transferred to a datainputting means and storage means for ordered archiving of the datarelating to the individual proteins and spots. Scanned images of 2Dprotein patterns can be subjected to an automated image analysisprocedure using batch process computer software, such as the Kepler®system that subtracts image background, and detects and quantifiesspots. The final data for a 2-D gel, a series of records describingposition and abundance for each spot, among other distinguishingfeatures, then are inserted as records in a computerized relationaldatabase.

The storage of data and the comparisons between and among proteins isaccomplished with a data processing means. A data storage means archivesthe data on each of the protein spots on a storage medium. The digitizeddata can be transformed, filtered, enhanced and so on to clarify thescanned plot of protein data and information provided for each proteinor spot noted on the gels. The storage means that compiles and containsan ordered array of the protein information, such as the variousparameters and characteristics thereof, can be any known meansincluding, a printed medium, such as a book or table, or a computerreadable means, such as a compilation of data stored on a diskette,compact disc and so on.

One of the ways to index the proteins is to characterize each individualprotein based on the properties thereof, such as molecular weight,isoelectric point (pI), tissue distribution and primary amino acidsequence.

Thus, a protein index of interest is one wherein proteins arecharacterized by having at least three descriptive parameters thereof,pI, MW and tested for expression in a variety of tissues, at least fivetissues having been examined for expression thereof, as providedhereinabove. Moreover, the tissues can be obtained from a singleindividual of a panmictic population to control polymorphism and normalvariation.

Another way to index the proteins is to characterize each spatially inthe context of a gel pattern. While molecular weight and pl aredeterminative of the location of a protein spot on a gel, therelationship of any one protein spot to another spot or other spots on agel can provide additional identifying parameters of the proteins.Frequently, identical proteins behave slightly differently in differentsamples to give a slightly different gel location. In addition, somevariance may be observed in different batches of gels being run.

By aligning two patterns in a best fit (“spatial matching” or“warping”), spots that are shared by two samples and spots that appearto be unique to one or the other, in the absence of specific sequencedata, may be revealed. Such pair-wise comparisons can be made over anycombination of samples. The warping process to obtain a best fit ofpatterns comprises not only a static matching of gel patterns but alsoan electronic manipulation of patterns by, for example, stretching,rotating, shrinking and so on portions of one or both gels beingcompared to maximize the register of spots or landmark spots on thegels.

A number of different measures, or combinations thereof, for determiningdistance or similarity of protein or of spots can be employed. Forexample, suitable measures of distance and/or similarity for use withcluster analysis, multi-prototype classification and multidimensionalscaling are Euclidean, average Euclidean, Mahalanobis, Minkowski,average Minkowski, maximum value, minimum value, absolute value, shapecoefficient, cosine coefficient, Pearson correlation, rank correlation,Kendall's tau, Canberra, Bray-Curtis and Tanimoto, also known as Jaccardcoefficient.

A comparing means is used to analyze spectra, or other identifyingfeatures, of the spots occurring on two or more 2-D gels. A similaritythreshold may be selected to identify spots that could be the same.Alternatively, a more complex clustering threshold can be used. Denotedspots having similar spectra and that have similar positions (as judgedby the X and Y positions of the spots on the 2-D gels after alignment bythe imaging means) can be considered likely candidates for identity.

A large number of such pairs (in the case of a comparison of two gels)are analyzed by a comparing means as a group to yield a best fit andhence to derive a global geometrical mapping of a plurality of spots ona gel. That mapping to form a two dimensional spot pattern which thenforms the basis for a generalized matching wherein newly obtained spotsare compared to those spots that comprise the standard pattern ofproteins that have been characterized and already exist in the index.

Judicious choice of very diverse and very similar tissues could reducethe number of pair-wise comparisons that might need to be made. Having ascanning means and data storage means also would minimize the number ofactual comparisons that need be made as a computer processing means canmake those comparisons.

Thus, such a spatial analysis provides additional identifying parametersof a polypeptide comprising an index of interest.

Assignment of spots that are matched to a particular locus, site,address or cell on the reference 2-D gel can be validated, for example,by employing techniques providing additional information, such as,fragment mass, detailed molecular weight information or sequenceinformation as can be obtained, for example, using MS, LC/MS/MS oractual sequencing, of the proteins of interest. Other methods ofdetermining identity of proteins between and among gels include bindingby a specific ligand or co-factor, a receptor lectin or an antibody.

To obtain such additional information, a protein may be isolated fromthe 2-D gel matrix. A suitable technique is to isolate the individualprotein spots and to extract and to purify the protein(s) from thematrix. That can be accomplished by known means and methods. A spot canbe excised manually or robotically, based on scanning or previouslyobtained information contained in the index as to a protein's locationin a warped 2-D gel, by means of a robotic spot cutter controlled by aprocessing means.

Then, the purified preparation of a protein or proteins with aparticular molecular weight and pI are analyzed by another method ofcharacterization, such as, sequencing, immunologic identity, liquidchromatography or mass spectrometry (MS). There are methods of MS thatare suitable for analysis of biomolecules, such as proteins. Some ofthose MS methods include matrix assisted laser desorption ionization(MALDI) MS, LC/MS/MS (liquid chromatography/tandem mass spectrometry)and MALDI-time of flight (TOF) MS. LC/MS/MS is particularly useful whenanalyzing hydrophobic proteins, such as membrane proteins, and forproviding primary amino acid sequence data.

To conduct MALDI MS or MALDI-TOF MS, it may be necessary to take theproteins contained in a spot and to digest same to produce a collectionof smaller oligopeptides as the smaller molecules are more amenable toseparation and identification by those techniques. The means to obtainthe oligopeptides are known and include mild hydrolysis by acid or base,digestion with particular proteases, peptidases, cyanogen bromide and soon. A number of oligopeptides from a single protein spot can beanalyzed. A suitable size of the oligopeptides is on the order of about5 amino acid residues to about 30 amino acid residues, however, thosesize limits are variable and can be dictated by the cleavage method andthe level of discrimination afforded by any one particular analyzingmeans that is used. Thus, the mass spectrometry data providesinformation on the mass of peptide fragments of the polypeptide(s)comprising a spot.

MALDI MS data enables identification of the same protein on different2-D gels. MALDI MS data can identify the parent protein in a sequencedatabase search particularly when the oligopeptide is unique for theprotein. Uniqueness is enhanced for proteins encoded by single copygenes or when the oligopeptide is larger.

LC/MS/MS provides additional information, particularly, actual aminoacid content of a peptide. Each of the peptides is fragmented and themasses of the fragments are measured. In general, the peptides fragmentat the peptide bonds. Thus, the fragments generated have massesdiffering by amino acid masses, which average about 100 daltons each.Therefore, by interpreting the fragment masses, it is possible toascertain the amino acid sequence of the peptide. The result is aprotein wherein the specific primary amino acid sequences of portionsthereof are known.

The MS peak data (essentially a table of the masses of the peptidesobtained from each spot) also can be compared by a data processing andcomparing means to obtain relationships between and among spots. Thatdata can be manipulated to obtain relative spot:spot similarities. Thatexercise can obviate the need for the actual sequence of certainpeptides.

The use of mass spectrometry (MS) and other protein identificationmethods to provide additional information on each protein spotfacilitates the comparing, matching and collating of 2-D gel patternsinto a coherent, all encompassing reference protein database thataccounts for normal variation, tissue-specific differences, cellulardifferences and so on.

To assist in determining identity of proteins, the 2-D gel patterns ofproteins from different sources can be compared. Therefore, the patternsof two gels are compared to determine which protein spots are held incommon between and/or amongst the gels. That exercise also will revealwhich protein spots vary and in what manner those proteins vary. Byvarying the source of the proteins, such a comparison also will revealwhat is normal variation of a protein and whether a protein is specificfor, for example, an organelle, a cell or a tissue.

To minimize polymorphism, particularly in the case of a randomlybreeding population, tissues from an individual could be used. Thus,samples are obtained from a single genotype therefore minimizing geneticvariability imposed at the population level. Intraindividual variabilityshould be revealed, such as between tissues or cells. Moreover, theinformation is obtained from primary tissues as compared to, forexample, cell lines, which often are transformed in some fashion.

Another means for assisting in demonstrating similarity between twosamples is to combine two protein sources to provide a mixture forseparation in a gel. A gel containing the separated protein mixture iscompared with the gel patterns of each protein source separatedindividually to obtain a spatial comparison. The mixtures can be at aneven 1:1 ratio of the amounts of the two protein sources or can be inother predetermined ratios, for example, in a graded series of mixtures,such as, 1:10, 1:2, 1:1, 2:1, 10:1, wherein the ratios represent therelative amounts of the two parental protein sources. Other ratios canbe used. The various samples are separated by 2-D gel electrophoresis.The 1:1 mixture reveals spots specific for one or the other proteinsource. Then by comparing the gels of the graded mixtures, the change ofa spot based on protein source can be observed. That exercise allows anassessment of spot identity with two sources. If the spot relocates inthe graded mixtures, it is likely two distinct nearby spots would beseen in the gel of the 1:1 mixture.

By combining 2-D gel electrophoresis with a further proteinidentification means, such as mass spectrometry, it is possible toidentify spots as likely to be the same on different gels, and thus, forexample, originating from different organs, tissues, cells, organellesand so on. There may be spatial dissimilarity of the spots betweenand/or among gels. That can arise, for example, by experimental sourcesor natural sources. Experimental sources can be identified and minimizedby refining techniques, such as consistency of materials and methods.Other sources of variation may be inherent in the molecules, such asallelic variation and so. All such data are diagnostic.

Hence, the data will reveal the general location of a particular spot ona 2-D gel and therefore, spots can be aligned between and/or among gelsdespite variations in spot location on one or more gels.

Such identified spots can serve as landmarks for the warping procedurewhen comparing plural gels for a best fit. Warping can occur on 2-D gelpatterns without further characterization of spots. However, furthercharacterizing information lends confidence to the establishment oflandmark spots. The further characterizing need not require totalidentity such as revealed by sequencing. Provisional identity can beobtained by immunological studies, other specific binding to cofactors,substrates, subunits, etc., partial sequencing, fragmenting thepolypeptide and so on. For example, mass spectrometry, such asMALDI-TOF, would provide information on peptide fragment masses in ahigh throughput manner. The nature of fragmentation and the masses ofthe fragments can be diagnostic for a polypeptide residing in a spot.

By such identification, provisional or proven, of particular spots invarious sites of a gel, the warping of gel images can be redone toaccount for a greater array of spots.

In addition, by such identification, it is possible to determine withconfidence, without employing a particular protein identifying means,the identity of a spot on succeeding gels, if that spot localizes to anarea where a known protein localizes. The accumulated data will providea zone where an identified protein exists, even if that protein exhibitsviability in different individuals, organs, tissues, cells and so on.

The value of such identification of particular spots on a gel, forexample, by mass spectrometry, is that by selection of a subset of spotslocalized to various regions of a gel, only that subset need beidentified to enable warping of gels to reveal spots of likely identityand those specific to a gel, and thus specific to the source of theproteins.

The identification of only a subset of landmark proteins or spots andwarping enables a more rapid comparison of a plurality of gels and aprovisional assignment of protein or spot identity in succeeding gels.Thus, a spot, not previously identified, that is found to reside at aparticular location on a number of gels with or without warping, can beprovisionally considered the same polypeptide or protein. Thatprovisional assignment can be confirmed by a particular proteinidentification means, such as, an immunoassay or mass spectrometry.

In addition, by identifying certain landmarks and warping, there nolonger is a need to compare 2-D gel spot patterns that appear grosslysimilar. If the landmarks represent proteins found in a wide range ofsources, and either the protein shows little or no variation or aconfident level of variation is known, then the gel pattern of any newsource can be compared to the reference gel pattern.

The greater the number of landmarks, the more exacting the warpingprocess may be. However, at the onset, comparisons can be made with asfew as 5 landmark spots. Preferably, there are more than 5 landmarks andwith each provisional or proven assignment of spot identity, thelandmark data base is enhanced.

An outcome of the development of landmarks is a theoretical referencespot pattern containing the landmarks. Proteins of low variability willappear as discrete spots with sharp borders. Proteins more variable willbe represented as a zone or region of location, the radius of the zonecorrelating to the amount of variability observed. That referencepattern may find use with the gel patterns of a wide range of proteinsources.

Therefore, gels in which 90% or more of the spots are identical can becompared. But gels of lesser similarity can be compared by warping, suchas gels with 80% or greater spot identity; gels with 70% or greater spotidentity; gels with 60% or greater spot identity; gels with 50% orgreater spot identity; gels with 40% or greater spot identity; gels with30% or greater spot identity; or even gels which overtly appeardissimilar but for the landmark spots.

The spatial and additional spot characterization, such as MS data,enable relaxing the spatial stringency of the matching process byintroducing additional identifying information for each peptide and eachprotein. The spatial and MS data also can reduce the number of tissuecombinations that need to be performed to identify and to characterize aprotein.

The storage means acquires the data so collected and catalogs said datain a storage means for later analysis. A collating and comparing meanson an individual protein can determine, for example, whether a spotrevealed by one staining procedure is the same as another spot revealedby another staining procedure. That type of comparative analysis alsowill reveal whether different staining procedures, different gels,different gel separation procedures and the like, result in variation inthe location of a protein based on molecular weight and pI on the 2-Dgel.

The comparing means of MS data and spot matching can involve the step ofcomparing all spectra against each other according to some particulardistance metric to yield a matrix of the similarity of each spot to allthe other spots. Alternatively, the comparing means may independently,or in conjunction with the above, cluster the spots that are similar toone another. Ideally, clusters contain the same protein even whenexpressed in different tissues.

A preferred means for comparing and analyzing the data in thedevelopment of a protein index is to have the data obtained, stored,processed, analyzed, compared and so on in a form and manner that iscompatible with a computer. Thus, for example the data is archived indigitized form on a computer readable medium.

To know which protein spots are versions of other spots, even within thesame tissue, MS, for example, can provide insight to that relationshipby demonstrating that a series of several spots on a gel have the samepeptide mass pattern.

Thus, the MS data (e.g., MALDI peptide masses) can be searched by a datacomparing means to identify samples demonstrating similarity (of, forexample, each spot of the gel to all other spots on the gel). Thecomparing means and data collation means will reveal clusters of spotsthat are likely (because of the similar peptides contained therein) tobe versions of the same gene product.

Then each cluster is analyzed by a comparing means to select membershaving a very similar molecular weight, indicating that the selectedproteins have the same or very similar polypeptide chain length andcomposition. The selected proteins then are analyzed further by acomparing means to determine if the pI separations between and among theproteins are consistent with differences amounting to integral charges,the most likely scenario if the proteins are simple chemical isoforms ofone another.

The identification exercise can be facilitated if the protein is matchedwith a full-length gene sequence encoding the protein. The full-lengthgene sequence can be used to compute a theoretical pI of the deducedamino acid sequence and a delta pI/charge value for the deduced aminoacid sequence. The position of the protein spots then can be compared tothe theoretical pI to determine which, if any, is likely to correspondto the unmodified protein. The comparing means also can be used tocompare the differences in the pI positions with the calculated deltapI/charge to determine whether the putative isoforms of the samemolecular weight are likely to be single charge variants of one another,the most likely result in phosphorylated proteins.

Members of a cluster can be analyzed further by a comparing means Usingquantitative data from various experiments to determine if there is aninverse variability between spots, which could be observed if theisoforms were transformed from one form to another by a modificationprocess, or if there is coordinate variability between spots, whichwould be likely if all forms were increased or decreased together.

If a cluster contains one or more spots at the expected full lengthsequence position. and one or a small number of lower MW spots, then acomparing means can take the pI and MW of the smaller spots and comparethose with the pI and MW predicted for various subsections of the fulllength sequence to determine if a subsection would be predicted to havethe observed pI and MW. If so, some deductions may be possible regardingthe nature of the process that results in production of the shorterproduct, for example, if the postulated fragment arises from putativealternate splice sites, then message splicing events are likely to bethe cause of the differences. Alternatively, if the fragment has endsthat are the likely cut sites of a specific protease, thecharacteristics of the protease may be deduced.

One may use a variety of ways to list the proteins in an orderly manner.An arbitrary alphanumeric descriptor can be assigned to the individualproteins. Alternatively, the proteins can be sorted by an individualparameter or characteristic, such as cell source, chromosome source,function, tissue source, pI, molecular weight, map coordinate position,some other name, symbol or acronym established from another list and soon. An artisan can select the criterion or criteria for ordering andselecting the proteins for ready accessibility.

A more complete description or definition of a protein will, therefore,contain an increasing set of descriptors, such as, the molecular weightand pI data, as well as MS data and protein name, if known. A largenumber of distinguishing characteristics would enhance reference valueof the database. However, there may be for any one protein, a minimalset of unique defining characteristics that will be diagnostic foridentifying that protein. That is true particularly for a provisionassignment of identity. Moreover, the identify of a polypeptide or spotis not necessary for entry of a protein into the database.

The index will serve as a reference resource providing identifyingcharacteristics of the polypeptides so that any newly identifiedpolypeptide can be compared to those already cataloged to determineeither the identity of the newly identified polypeptide or the need toincorporate the newly identified polypeptide as a new entry of theindex.

As discussed hereinabove, identified proteins will establish landmarkson 2-D gels that will enable warping and fitting of gels to correct forvariation in the proteins and running conditions.

Therefore, in the context of spots on 2-D gels, there are a number ofsets and subsets of protein spots depending on apparent identity betweengels, based on, for example, pI, MW, tissue distribution, massspectrometry data, primary sequence and so on.

A number of spots will be identical between the two gels. The identicalproteins can be identified as comprising population or set W. A subsetof proteins of set W will yield spots on the gels that overlap or appearto fall at the same site on the gels, once the gels are properly warpedto ensure a best fit between the two gels. That subset of seeminglyidentical protein spots comprises a population or set X. A subset ofproteins of set X of the two gels will have the same mass spectra. Thatsubset can be identified as population or set Y. Finally, a subset ofset Y comprises proteins that have identical spectra that match atheoretical spectra based on the primary amino acid sequence on theprotein. Those proteins comprise population or set Z. The proteins ofset Z are those actually identified and are likely candidates aslandmarks on 2-D gels. Proteins of subsets Y and Z, and perhaps subsetX, once tested for expression in a variety of tissues, as providedhereinabove, are cataloged in the database.

The process for assigning a protein or a spot to one or more of theabove sets, and also to determine the correspondence of protein or spotbetween two gels may proceed along the following chain of events.

The spot patterns of the two gels are digitized by an image scanningmeans. The information collected includes, for example, the density,size and shape of the spot.

For spots that meet predefined criteria for characteristics of thespots, such as spot size, spot density, approximate pH, approximatemolecular weight and so on, those spots are excised from the gel by aspot extracting means so as to isolate the protein or proteins thatcomprise the spots.

The gel matrix is treated to enable extraction of the polypeptide(s)contained therein. Known methods are practiced.

The samples comprising one or more polypeptides are treated, such aswith an enzyme, for example, a protease, such as trypsin, practicingknown methods, to digest the polypeptide(s) into smaller peptidefragments.

The polypeptide fragments then are analyzed by mass spectrometry, suchas MALDI or MALDI-TOF MS to obtain mass spectra for the spot contents.

The mass spectrum of the individual spots is compared to that of knownproteins provided in available databases using an algorithm such asMaldiMatch to organize data and to assign spots and proteins topopulation or set Z.

Then the data of the spots are compared between the two gels using analgorithm, such as MaldiMatch, at high stringency to identify proteinsthat comprise population or set Y. By high stringency is meant theparameters defining the search and analysis of data are configured toprovide high sensitivity. For each spectrum, peaks are detected usingknown algorithms, such as RADARS, to yield a set of centroid m/z peaksthat are reporting in Daltons and relative intensity. Then the comparingalgorithm, such as MaldiMatch, performs a dynamic calibration thatentails rounding the molecular weight assignments for 10-20 of the mostintense peaks of a spectrum to the nearest 1-2 Dalton units. Pairs ofpeaks of similar molecular weight are identified and the difference inhigh resolution mass is calculated. If a significant number of pairs areidentified, a search is conducted to determine if a common massdifference or a mass difference or offset that affects all or asignificant number of pairs of peaks is present. Then, one or both ofthe spectra are modified by adjusting the peaks therein by thecalculated offset or molecular weight difference. Then, the spectrasimilarity is calculated where the similarity is a function of all masspeaks and the intensity thereof in either spectrum. Similarity valuesabove an empirically derived threshold are considered matches. Thethreshold is one that is derived by conducting the above exercise forknown proteins.

The data of set Y are used as initial landmarks in an algorithm, such asKepler, that conducts the initial image processing and analysis, theproteins of set Y comprise the landmarks to facilitate the warping ofgel images to bring remaining spots into alignment in a best-fitaccommodation.

Those spots of both gels not yet assigned to set Y that have similarpositions following warping are tentatively assigned to population orset X.

Each pair of associated spots from the two gels is analyzed by massspectrometry and spectrum matching as described hereinabove to confirmthe tentative identity of the spots and the protein contained therein.The spectrum-matching algorithm, such as MaldiMatch, will be run at highspecificity. Peaks are detected and reported in Daltons. Peak intensityalso is recorded. That data comprises the peak list. All peaks arerounded to the nearest 1-2 Daltons to overcome calibration-relateddifferences between identical samples. For each spot of one gel, thepeak list thereof is compared to all peak lists for spots on the othergel. For a given comparison of peak lists, similarity is measured asfunction of all the peaks present in both lists, as well as theintensity thereof. An empirically derived threshold is used to selectcandidate matches. The threshold is derived by comparing known proteins.Candidate matches are subjected to dynamic post acquisition calibrationand the similarity is recalculated. An empirically derived cutoff isused to determine if the spots in question have the same proteinconstituents. The cutoff is derived from studies done with knownproteins. That analysis detects true differences between spots andyields proteins or spots that comprise population X.

The data of proteins comprising population X then serve as landmarks inanother iteration of the image analysis to again warp the gels. Spots onthe gels found at the same position in the warped gels but not alreadyassigned to set X are tentatively assigned to set W.

To confirm assignment of the proteins to the various sets, individualproteins can be further examined, such as by LC/MS/MS to determineprimary amino acid sequence for comparison, if available, to knownsequences of known proteins.

In the above described spectrometry data comparison analysis, a varietyof matching algorithms, such as Jaccard coefficient or weighted Jaccardcoefficient, can be used. In the Jaccard coefficient, data istransformed by obtaining the ratio of the number of peaks appearing inboth spectra divided by the number of peaks appearing in one or morespectra.

When the data collation and comparisons are completed, thecharacterizing information for each polypeptide then is stored. Themethod of storage is variable and sorting can be based on any of avariety of the characteristics of the polypeptides. The database cancontain entries for at least 10 polypeptides; at least 15; at least 20;at least 25; at least 30; at least 40; at least 50; at least 60; atleast 70; at least 80; at least 90; at least 100 proteins. A database ofinterest is one wherein each of the polypeptides therein has been testedfor expression in plural tissues as provided hereinabove. Thus, forexample, each of 10 proteins has been tested for expression in at least5; at least 6; at least 7; at least 8; at least 9; at least 10; at least11; at least 12; at least 13; at least 14; at least 15; at least 16; atleast 17; at least 18; at least 19; or at least 20 tissues. More than 20tissues can be examined.

As discussed hereinabove, a suitable first step is to develop a databasethat accounts for the proteins of a number of different tissues.Preferably, the tissues are obtained from members of an inbred strain oran individual to minimize variation. The inbred strain can be of amicrobe, plant or animal. The microbe, plant or animal can be wild, ofagricultural significance (whether desired or pests) or for laboratoryuse. Suitable examples are agricultural livestock and crops, laboratoryanimals and so on. The database can include cellular and subcellularinformation. Populational variation can be quantified by studyingsamples from plural individuals of a population. It may be possible tomake interspecies comparisons with samples obtained from the same tissuebut from different species.

The index can provide a variety of uses beyond the identifying purposes.For example, the index can be used to reveal metabolic changes of anorganelle, cell, tissue and so on under varying environmentalconditions, such as, for example, temperature change, exposure toatypical states and environments, chemicals and so forth. For example,exposure to a particular biological inducer can result in expression ofpreviously under expressed or unexpressed proteins, loss of or loweredexpression of certain proteins and variation in certain proteins. Otherconditions include exposure to toxins or to pathogens. In addition,changes in protein expression can arise from a disease state or as anatural result of aging.

Finding proteins that arise in a disease state will enable thedevelopment of diagnostic assays, which may be 2-D gel electrophoresistogether with other associated methodologies, such as mass spectrometry,but could also be other diagnostic means, such as a nucleic acid-basedassay or an immunology-based assay, such as an ELISA, once a particulardiagnostic protein is revealed.

Another source of proteins for study are cell lines that can bemaintained in vitro for long periods of time. The protein index mayprovide a basis for selecting certain cell lines as being particularly,if not wholly, representative of a naturally occurring cell, tissue,organ or organism.

In a similar vein, the proteins of a biopsy specimen or primary cell,tissue or organ culture can be studied to monitor the status of thecells across multiple passages to ensure the culture remains useful forthe intended purpose.

As discussed hereinabove, when spots and/or proteins diagnostic for thesource of protein are identified, the actual diagnostic assay need notbe 2-D gel electrophoresis or mass spectrometry, but can be any assayspecific for that diagnostic protein, such as specific binding assays,such as an ELISA.

At some point in time, the need for the initial protein characterizationby, for example, 2-D gel electrophoresis, may be unnecessary and othermethods may be employed to provide sufficient diagnostic information toprovide a provisional, if not exact, identification of a protein.

For example, a particular protein may be available in pure form. Thatprotein can be fragmented and the fragments examined by massspectrometry to yield fragmentation pattern and fragment mass. Thatinformation may be diagnostic, thereby foregoing the need for 2-D gelelectrophoresis. Such a 2-D gel bypass is not reliant solely on massspectrometry, such as MALDI-TOF that is high throughput, but can be anymethod that reveals diagnostic information on the protein, and thatdiagnostic information exists in the database.

The database of interest permits new analytical measurements other thanthe conventional “control vs. treated” experiment structures. Theinstant invention is directed at the analysis of multi-experimentdatabases. The methods provide better tests of the significance ofobserved changes, and allow the comparison of one set of changes withanother for purposes of mechanism classification. Results of such alarge-scale analysis of the effects of 50 different drugs has been done,including the identification of protein markers for efficacy andtoxicity.

A second area of interest is in the comparison of various human tissueproteomes. The tissue-to-tissue similarities and differences observed inthe practice of the instant invention provide insights into therelationship between structure and function at the organismal level, aswell as in the process of development.

By measuring the abundance of every or at least a very large number ofproteins in a particular tissue, cell type or fraction from astatistically significant number of individuals, one can prepare adistribution of amounts for each protein. Using statistical analysis,such as 2 or 3 standard deviations, one can state that certain proteinsare higher or lower in abundance in certain individuals. If thoseindividuals are unique in any manner, such as having a disease, one maysuspect the protein(s) are markers for the disease and perhaps areinvolved in the disease mechanism in some fashion. The association-basedhypothesis is then provable by later experiments.

By observing when certain combinations of proteins appear simultaneouslyor antagonistically, such the when the expression or appearance of onecan predict the expression or appearance of one or more other proteins,the expression of the two or more proteins may be correlated, eitherpositively or negatively. That implies that the genetic control of thoseproteins may be co-regulated in some manner. It is also likely that somecombinations of co-regulated proteins represent at least part of ametabolic pathway.

For example, 80 pairs of monozygotic twins were selected for maximaldisease phenotype discordance. The within-pair differences areindicative of pure non-genetic disease phenotype effects. That was doneto reduce background noise due to polymorphisms. Within-paircorrelations were made.

A master spot pattern of 970 spots was generated for 32 twin pairs, seeFIG. 3. Spot to spot correlations across the subjects was performed todetect apparently co-regulated proteins. A 118 spot subpattem classified64 subjects into pairs with 88% accuracy. The results are given in FIGS.4-6 with lines between spots indicating proteins that appear to beco-regulated by virtue of a correlated pattern of expression. The numberof correlations suggests that metabolism is considerably more complexthat previously thought.

A complete Human Protein Index (HPI) would mark the completion of humanprotein molecular anatomy, with each protein described, all stages inthe maturation and transport thereof described, and the mature place ofthe protein in cellular molecular anatomy known. Fortunately, the sametechnologies and processes required for the HPI are those required toexplore development, cell function and disease states at the molecularlevel.

One of the most basic questions in biology concerns the mechanisms andprogram underlying differentiation. Differentiation can be viewed as aprogressive diminution of gene expression in a cell as various geneticprograms are relegated to non-expression. Metaplasia, dedifferentiationand redifferentiation are other manifestations of the basis theme,albeit at lesser occurrence. In those circumstances, the exceptionoccurs and quiescent genetic programs are once again active or may neverhave been silenced.

Many theoretical approaches have been formulated to describe howdifferentiation operates. Those almost invariably postulate theexistence of sets of batteries of genes that are switched on or offtogether, and that are organized to be expressed in a prearrangedsequence. In the simplest case, one set of protein gene products wouldcontain a derepressor activating a second set, while the second setwould contain a repressor for the first and a derepressor for a third.Such a chain of events could be irreversible.

While many examples of coregulation of gene expression are known, noprotein database or index contains definitive examples. Further there isdisagreement as to whether the organization of the genome operatingsystem is such that relatively few co-regulated sets exist, or whether,as has been proposed, all proteins are part of an interconnectedsignaling network in which the presence, absence, or change in abundanceof any one protein causes changes in the abundance of many others.

Many of those questions can be approached by selectively analyzing thedata obtained in the practice of the instant invention. One can sort thedata to reveal proteins are found in all nucleated somatic human celltypes, and hence may be assumed to be part the general housekeepingsystems. Others may be unique to a stage in the cell cycle, to one or afew cell types, to certain stages in differentiation, or to cellsderived from one germ layer. The problem of coregulated sets may beapproached by asking which proteins are always either expressedtogether, i.e., if one, then all, if not one, then not all.

Some genes may not be switched off at any time and may be part of abasic housekeeping set. Computerized searching of the data contained inthe HPI allows both candidate co-regulated sets and the set of basichousekeeping proteins to be identified. Confirmation of a setidentification may be made by using inhibitors that up or down regulateone member of a putative set, to see if other presumed members aresimilarly affected.

Instances are known where introduction of an inhibitor of one member ofa co-regulated set produces up regulation of that member, a concomitantdecrease in the biochemical activity of the factor, and coordinated upregulation of another member of the set. That mechanism, termed a “caromshot”, is the only currently known technique for up regulatingexpression of a particular gene. Hence, the identification of members ofcoregulated sets is of great pharmacological significance.

Since many proteins have diagnostic significance, there is also a needfor detecting and quantitating defined sets of proteins in body fluidsand tissue samples, using simple and ultimately inexpensive methodsanalogous to DNA chips. Protein chips that carry a wide array ofdistinct proteins can be made and used to screening and diagnosticpurposes, see for example, U.S. Ser. Nos. 482,460 and 628,339.

EXAMPLE Preparation of the Human Protein Index

A single female who died of cardiac arrest was dissected within hoursand finished within 24 hours after death. 149 tissues were recovered andsnap frozen in liquid nitrogen. Two male donors were dissected within 4hours of death and 8 tissues recovered in the same manner to recovermale specific tissues.

Samples were prepared by solubilization of frozen tissue. Once thetissue was solubilized, the resulting protein sample was stored at −80°C. until thawed for 2-DG analysis. Briefly, this protocol involveshomogenizing a small weighed piece of tissue in an eight-fold excess(weight/volume) of 4% IGEPAL CA630, 9M urea (analytical grade, e.g. BDHor BioRad), 1% dithiothreitol (DTT; Gallard Schlesinger) and 2%ampholytes (pH 8.0-10.5; BDH).

Sample proteins were resolved by 2-DG electrophoresis using the LSPProGEx system. All first dimension isoelectric focusing gels wereprepared using the same single standardized batch of ampholytes (BDH pH4.0-8.0) selected by previous batch testing. Eight to thirty microlitersof solubilized protein were applied to each gel and the gels were run ingroups of 25 for 25,050 volt-hours using a progressively increasingvoltage protocol implemented by a programmable high voltage powersupply.

An Angelique™ computer-controlled gradient casting system was used toprepare second dimension SDS gradient slab gels in which the top 5% ofthe gel was 8% T acrylamide, and the lower 95% of the gel varieslinearly from 8% to 15% T. Each gel was identified by a computer-printedfilter paper label polymerized into the gel. First dimension IEF tubegels were loaded directly onto the slab gels with a brief equilibrationof 9 mM dithiothreitol (DTT; Gallard Schlesinger), 125 mM Tris pH 7.0(Sigma), 2% SDS (J. T. Baker), 10% Glycerol (BDH), and trace bromophenolblue. Equilibration buffer was removed and tube gels were held in placeby hot agarose. Second dimension slab gels were run in groups of 25 for1,280 volt-hours in thermal-regulated (20° C.) DALT tanks with buffercirculation. Following SDS electrophoresis, slab gels were stained forprotein using either a colloidal Coomassie Blue G-250 procedure orsilver staining.

The Coomassie Blue G-250 staining procedure is performed in coveredplastic boxes, with 12-13 gels per box and involves fixation in 1.8-1.9liters of 50% ethanol/3% phosphoric acid overnight, three 30 minutewashes in 2 liters of cold deionized water, and transfer to 1.8-1.9liters of 34% methanol/17% ammonium sulfate/3% phosphoric acid for onehour followed by addition of a gram of powdered Coomassie Blue G-250stain. Staining requires approximately 4 days to reach equilibriumintensity. Stained slab gels were scanned and digitized in red light at133 micron resolution, using an Eikonix 1412 scanner and images wereprocessed using the Kepler® software system.

For silver staining gels were fixed in 1.8-1.9 L of 50% ethanol/3%phosphoric acid for 4 hours and then washed in DI water for 1 hour. Thegels were then clipped onto a gel hanger and processed through the fullyautomatic Argentron™ silver stainer. The individual steps includeagitation for 30 seconds in deionized water, one minute in 0.44 g sodiumthiosulfate in 2 L DI water, 10 seconds in deionized water, 30 minutesin 4.6 g silver nitrate in 2 L DI water and 0.78 ml 37% formaldehyde, 10second DI water wash, 20 minutes in 66 g potassium carbonate, 0.033 gpotassium thiosulfate in 2 L deionized water with 0.78 ml of 37%formaldehyde. Images are taken at 30 second intervals and thedevelopment is stopped in 88 g tris (hydroxymethyl) aminomethane in 2 Ldeionized water and 44 ml glacial acetic acid.

For protein identification by mass spectrometry, gel pieces containingthe proteins of interest were automatically excised from Coomassiestained gels and placed in 96-well polypropylene microtiter plates.Samples were in-gel digested with trypsin according to the procedure ofShevchenko, et al., Analytical Chemistry 68: 850-858 (1996), with slightmodifications. Briefly, the excised samples were destained by two 60 mincycles of slight shaking in 200 μL of 0.1 M NH₄HCO₃ in 50% CH₃CN withthe resulting solution aspirated after each cycle. Reduction wasaccomplished by adding 40 μL of 10 mM DTT in 0.1 M NH₄HCO₃ andincubating at 37° C. for 45 min. After cooling to room temperature,samples were alkylated by adding 40 μL of 55 mM of iodoacetamide in 0.1M NH₄HCO₃ and incubated at room temperature in the dark for 30 min. Thesupernatant was removed and 100 μL of 100% CH₃CN was added to eachsample. After 10 minutes the CH₃CN was removed and the gel pieces driedfor 30 minutes in a Speed-Vac concentrator. To each gel sample, 4 μL of12.5 μg/μL modified Trypsin (Promega) was added, the plates sealed, andincubated at room temperature overnight. Trypsin was prepared in either3 mM Tris (pH 8.4) or 10 mM NH₄HCO₃ (pH 8.8), depending upon theselection of MALDI matrix. Extraction of the proteolytic peptidefragments from the gel pieces was accomplished by adding 8 μl of 0.1%TFA in 50% CH₃CN, followed by slight shaking for 15 minutes.

All samples were prepared using one of two protocols employing a 96-tipliquid handling robot (Model CyBi-Well, CyBio AG, Jena, Germany). Thefirst protocol entails the use of 2,5-dihydroxybenzoic acid (DHB) as theMALDI matrix utilizing a modified version of the dried droplet method,Karas et al, Analytical Chemistry 60: 2299-2301 (1988). The samples wereprepared on either 400 □m AnchorChip™ targets or 600 □m AnchorChip™targets manufactured by Bruker Daltonics. The DHB matrix solution (4g/L) was applied first to the anchor target (0.6 μl for 400 □m anchors;1.2 μl for 600 □m anchors) and allowed to air evaporate. The peptidesolutions that were previously prepared in a Tris buffer (0.6 μl for 400□m anchor targets; 1.2 μl 600 □m anchor targets) were deposited on tothe anchors containing the dried DHB matrix. The MALDI sample wasallowed to air evaporate. The second protocol employs□-cyano-4-hydroxycinnamic acid as the MALDI matrix utilizing a modifieddried droplet method Karas et al, Analytical Chemistry 60: 2299-2301(1988) employing 600 □m AnchorChip™ targets. The matrix solution wasprepared by dissolving □-cyano-4-hydroxycinnamic acid in acetone at aconcentration of 1 g/L. This matrix solution was diluted 2:1 withethanol for a final matrix concentration of 0.33 g/L. The peptidesolutions previously prepared in an ammonium bicarbonate buffer (0.6 □l)was applied first to the 600 □m anchors, then 1.7 □l of matrix solutionand the sample allowed to air evaporate. The dried MALDI samples werewashed by dispensing 7 μl of 1% trifluoroacetic acid, allowing the washsolution to remain on the MALDI sample for approximately 15 seconds. Theentire volume of wash solution was aspirated and air dried. The MALDIsample was recrystallized by dispensing 0.5 μl of6:3:1/ethanol:acetone:1% trifluoroacetic acid on to the washed samplesand allowed to air evaporate.

MALDI experiments were performed on Bruker BiFlex III time-of-flightmass spectrometers (2.0 m linear flight path) equipped with delayed ionextraction. A pulsed nitrogen laser (Model VSL-337i, Laser Science,Franklin, Mass.) at 337.1 nm (<4 ns FWHM pulse width) was used for allof the data acquisition. Data was acquired in the delayed ion extractionmode using a 19 kV bias potential, a 4.1 kV pulse and a 30 ns pulseddelay time. Dual microchannel plate (Model 1332-4505 GalileoElectro-Optics, Sturbridge, Mass.) detection was utilized in thereflector mode with the ion signal recorded using a 2-GHz transientdigitizer (LeCroy LSA 1000 series, Chestnut Ridge, N.Y.) at a rate of 2GS/s. All mass spectra represent signal averaging of 100 laser pulses.The performance of the mass spectrometer produced sufficient massresolution to produce the isotopic multiplet for each ion species belowmass-to-charge (m/z) of 3500. The data was analyzed using MoverZ(ProteoMetrics, LLC, New York, N.Y.).

All MALDI mass spectra were internally calibrated using masses from twotrypsin autolysis products (monoisotopic masses 841.50 and 2210.10).Mass spectral peaks were determined based on a signal-to-noise (S/N) of2. Three software packages, Protein Prospector, Profound and Mascot wereused to identify protein spots. The human protein database consisting ofSwissProt entries was used in the searches. Parameters used in thesearches included proteins less than 200 kDa, greater than 4 matchingpeptides and mass errors less than 50 ppm.

A home-built microelectrospray interface similar to an interfacedescribed by Gatlin et al, Analytical Biochemistry 263: 93-101 (1998)was employed. Briefly, the interface utilizes a PEEK micro-tee (UpchurchScientific, Oak Harbor, Wash.) into one stem of which is inserted a0.025″ gold wire to supply the electrical connection. Spray voltage was1.8 kV. A microcapillary column was prepared by packing 10 μm MAGIC C18particles (Michrom BioResources, Auburn, Calif.) to a depth of 10 cminto a 75×360 μm fused silica capillary PicoTip (New Objectives,Cambridge, Mass.). A 50-70 μl/min flow from a MAGIC 2002 HPLC solventdelivery system (Michrom BioResources) was reduced using a splitting teeto achieve a column flow rate of 350-450 nl/min.

Samples were loaded on-column utilizing an Alcott model 718 autosampler(Alcott Chromatography, Norcross, Ga.). HPLC flow was split prior tosample loop injection. Samples prepared for MALDI were diluted 1:3 in0.5% HOAc, and 2 μl of each sample was injected on-column. Using contactclosures, the HPLC triggered the autosampler to make an injection andafter a set delay time, triggered the mass spectrometer to start datacollection.

A 12 min gradient of 5-55% solvent B (A: 2% ACN/0.5% HOAc, B: 90%ACN/0.5% HOAc) was selected for separation of trypsin digested peptides.Peptide analyses were performed on a Finnigan LCQ ion trap massspectrometer (Finnigan MAT, San Jose, Calif.). The heated desolvationcapillary was set at 150° C., and the electron multiplier at −900 V.Spectra were acquired in automated MS/MS mode with a relative collisionenergy (RCE) preset to 35%. To maximize data acquisition efficiency, theadditional parameters of dynamic exclusion, isotopic exclusion and “top3 ions” were incorporated into the auto-MS/MS procedure. For the “top 3ions” parameter, an MS spectrum was taken followed by 3 MS/MS spectracorresponding to the 3 most abundant ions above threshold in the fullscan. This cycle was repeated throughout the acquisition. The scan rangefor MS mode was set at m/z 375-1200. A parent ion default charge stateof +2 was used to calculate the scan range for acquiring tandem MS.

Automated analysis of LCQ peptide tandem mass spectra was performedusing the computer algorithms SEQUEST (Finnigan MAT, San Jose, Calif.)and/or Mascot (Matrix Science Ltd, London, UK). The non-redundant (NR)protein database was obtained as an ASCII text file in FASTA format fromthe National Center for Biotechnology Information (NCBI). A specific ratprotein database was created by selecting rat protein sequences from theNR database. This database subset was used for subsequent searches.Protein identifications were based on obtaining good quality MS/MSspectra from a minimum of two unique tryptic peptides.

1570 gels (10 per tissue) were run for developing the respective tissuemaster patterns. 640 2-D gels were run for MS analysis. 776 2-D gelswere run for co-electrophoresis using the methods described above towarp images between two different gels representing different tissuesmaster patterns. A large number of 2-D gels were run for various otherpurposes related to the generation of the HPI.

115,693 proteins were isolated, detected and quantified from these 2-DGels.

Images from different tissues were warped with key landmark proteinsidentified by mass spectrometry as mentioned above.

A very large number of protein spots were characterized in detail byMALDI and Electrospray MS/MS. Many do not correspond to any knownprotein upon searching the various protein databases mentioned above andare identified by accession numbers, source and physical properties.2741 protein spots from Master Patterns from this study were identifiedand corresponded to known proteins. As many of these proteins are thesame but found in different tissues, 446 different unique named proteinswere confirmed. Another 400 proteins were identified and correspond toknown proteins when compared to previously developed master spotpatterns. Confirmed proteins which were not previously identified werenot counted above.

Extrapolating from the percentage of proteins which are and are nottissue specific, and previously identified vs. newly identified by thisexperiment, the database generated is believed to cover approximately18,000 unique “gene products”. This does not count “different” proteinsthat differ by post-transcription modification and are slightlydifferent chemically.

Tissue specific proteins were determined by subtracting proteins foundin more than one tissue from the lists of proteins found in each tissue.Tissue specific proteins are useful for determining the origin of atissue throughout embryonic development, determining the tissue originof a tumor to determine whether it is a primary tumor or a metastasisand thereby deducing appropriate therapy. They are also detected formeasuring the effects of trauma, disease, various physical and chemicalagents on different tissues by measurement of tissue damage by detectingtissue specific proteins in various body fluids, tissue samples ororgans and washings therefrom. These determinations aid in finding whichand all affected tissues, the extent of damage in each and monitoringthe viability of organs and tissues for transplant both prior to removaland after transport outside the body.

All references cited herein are herein incorporated by reference inentirety.

It will be evident to the artisan that various changes and modificationscan be made to the teachings herein without departing from the spiritand scope of the invention of interest.

1-29. (canceled)
 30. An ordered set of elements comprising at least N elements, wherein each of said N elements is a polypeptide or a protein, wherein presence or absence of each of said N elements is determined in at least 5 tissues from a single subject; each of said elements is analyzed by mass spectrometry and N is at least
 10. 31. The set of claim 30, wherein said set comprises at least 20 elements.
 32. The set of claim 30, wherein said polypeptide is of unknown function.
 33. The set of claim 30, wherein expression of said elements is tested in at least 7 tissues.
 34. The set of claim 30, wherein an element is characterized further by having a molecular weight value.
 35. The set of claim 30, wherein an element is characterized further by having an isoelectric point.
 36. The set of claim 30, wherein said subject is a human.
 37. The set of claim 30, wherein an element is characterized further by a cell of origin.
 38. The set of claim 30, wherein an element is characterized further by an organelle of origin.
 39. The set of claim 30, wherein said ordered set of elements is contained in a machine-readable storage medium.
 40. A machine readable storage medium comprising digitized data of an ordered array of N elements, wherein said N elements are proteins; and wherein said digitized data comprises expression of each of said N elements in at least 5 tissues of a single subject and a mass spectrometry scan of each of said elements; and N is at least
 10. 41. The medium of claim 40, comprising expression in at least 7 tissues.
 42. The medium of claim 41, comprising expression in at least 9 tissues.
 43. The medium of claim 42, comprising expression in at least 11 tissues.
 44. The medium of claim 40, wherein N is at least
 20. 45. The medium of claim 44, wherein N is at least
 30. 46. The medium of claim 45, wherein N is at least
 40. 47. The medium of claim 46, wherein N is at least
 50. 48-70. (canceled) 